2 research outputs found
High-dimensional Sparse Count Data Clustering Using Finite Mixture Models
Due to the massive amount of available digital data, automating its analysis and modeling for
different purposes and applications has become an urgent need. One of the most challenging tasks
in machine learning is clustering, which is defined as the process of assigning observations sharing
similar characteristics to subgroups. Such a task is significant, especially in implementing complex
algorithms to deal with high-dimensional data. Thus, the advancement of computational power in
statistical-based approaches is increasingly becoming an interesting and attractive research domain.
Among the successful methods, mixture models have been widely acknowledged and successfully
applied in numerous fields as they have been providing a convenient yet flexible formal setting for
unsupervised and semi-supervised learning. An essential problem with these approaches is to develop
a probabilistic model that represents the data well by taking into account its nature. Count
data are widely used in machine learning and computer vision applications where an object, e.g.,
a text document or an image, can be represented by a vector corresponding to the appearance frequencies
of words or visual words, respectively. Thus, they usually suffer from the well-known
curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a
few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering
algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion
phenomena, which both cannot be handled with a generic multinomial distribution, typically
used to model count data, due to its dependency assumption.
This thesis is constructed around six related manuscripts, in which we propose several approaches
for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive
word occurrences. In such frameworks, a suitable distribution is used to introduce the prior
information into the construction of the statistical model, based on a conjugate distribution to the
multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous
computational advantages. Thus, we proposed a novel model that we call the Multinomial
Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow
more modeling flexibility. Although these frameworks can model burstiness and overdispersion
well, they share similar disadvantages making their estimation procedure is very inefficient when
the collection size is large. To handle high-dimensionality, we considered two approaches. First,
we derived close approximations to the distributions in a hierarchical structure to bring them to
the exponential-family form aiming to combine the flexibility and efficiency of these models with
the desirable statistical and computational properties of the exponential family of distributions, including
sufficiency, which reduce the complexity and computational efforts especially for sparse
and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach
for count data to overcome several issues that may be caused by the high dimensionality of
the feature space, such as over-fitting, low efficiency, and poor performance.
Furthermore, we handled two significant aspects of mixture based clustering methods, namely,
parameters estimation and performing model selection. We considered the Expectation-Maximization
(EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model
parameters, with incorporating several techniques to avoid its initialization dependency and poor
local maxima. For model selection, we investigated different approaches to find the optimal number
of components based on the Minimum Message Length (MML) philosophy. The effectiveness of
our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate
speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV
shows, facial expression recognition, face identification, and age estimation
Evaluating the Dynamics of Knowledge-Based Network Through Simulation: The Case of Canadian Nanotechnology Industry
Collaboration is a major factor in the knowledge and innovation creation in emerging science-driven industries, where the technology is rapidly changing and constantly evolving, such as nanotechnology. The scientific collaborations among individuals and organizations form knowledge co-creation network within which information is shared, innovative ideas are exchanged and new knowledge is generated. Although various simulation attempts have been carried out recently to analyze the performance of such networks at the firm level, the individual level has not been much explored in the literature yet.
The objective of this thesis is to investigate the role of individual scientists and their collaborations in enhancing the knowledge flows, and consequently the scientific production within the Canadian nanotechnology scientists. The methodology involves two main phases. First, in order to understand the collaborative behavior of scientists in the real world, the data on all the nanotechnology journal publications in Canada was extracted from the SCOPUS database and the scientists' research performance and partnership history was analyzed using social network analysis. Moreover, the predominant properties that make a scientist sufficiently attractive to be selected as a research partner were determined using data mining and through a questionnaire sent directly to the researchers selected from our database. In the second phase, an agent-based model using Netlogo has been developed to simulate the knowledge-based network where several factors regarding the ratio, existence and absence of various categories of scientists could be controlled.
It was found that scientists in centralized positions in such network have a considerable positive impact on the knowledge flows, while loyalty and cliquishness negatively affected the knowledge transmission. Star scientists appear to play a substitutive role in the network as most famous and trustable partners to be selected when usual collaborators are scarce or missing. Besides, the changes in the performance of some categories in case of the absence of others have been also observed.
The major contribution of this work stems from the fact that the developed simulation model is the first one, which is fully based on the real data and on the observed behavior of the scientists in knowledge-based network